In this notebook, we investigate some of the empirical properties of limit order books (LOBs). This will motivate the models that we choose to use for representing the trading environment that different market participants face.
Level 1 data
Only the data from the touch, i.e. the best bid price $P^b_t$, the best ask price $P^a_t$ and their associated volumes $V^b_t$ and $V^b_t$. This is very noisy and so is not really suitable for building a very sophisiticated trading strategy off.
Level 2 data
This gives the order book in its entirety for all levels. In practice you won't use all levels to determine a strategy as the importance of prices and volume drops off the further away you are from the touch.
Level 3 data
This consists of orderbook data as well as details on the dynamics of the orderbook. We will see below an example of this data (provided from NASDAQ). In general it is very expensive to obtain, however for academics it is possible to access this quality of data from LOBSTER. This is still expensive, but not anywhere near as expensive as for institutions. In particular, data for the last couple of days is omitted so that one cannot actively trade off it.
LOBSTER is an limit order book data tool to provide easy-to-use, high-quality limit order book data.
Since 2013 LOBSTER acts as a data provider for the academic community, giving access to reconstructed limit order book data for the entire universe of NASDAQ traded stocks.
More recently, it has started to make the data available on a commercial basis.
import itertools
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import statsmodels.graphics.gofplots as sm
from statsmodels.graphics.tsaplots import plot_acf
import sys
sys.path.append("../")
from RL4MM.extras.get_book_trade_dataframe import get_book_trade_dataframe
msft = get_book_trade_dataframe(ticker="MSFT",trading_date="2012-06-21", levels=10)
aapl = get_book_trade_dataframe(ticker="AAPL",trading_date="2012-06-21", levels=10)
Getting data for ticker MSFT on 2012-06-21. Getting data for ticker AAPL on 2012-06-21.
msft.head()
| type | external id | size | price | direction | ask price 0 | ask size 0 | bid price 0 | bid size 0 | ask price 1 | ... | bid price 7 | bid size 7 | ask price 8 | ask size 8 | bid price 8 | bid size 8 | ask price 9 | ask size 9 | bid price 9 | bid size 9 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| timestamp | |||||||||||||||||||||
| 2012-06-21 09:30:00.013994120 | deletion | 16085616 | 100 | 31.04 | sell | 30.99 | 3788 | 30.95 | 300 | 31.05 | ... | 30.86 | 400 | 31.13 | 100 | 30.85 | 400 | 31.14 | 100 | 30.84 | 1600 |
| 2012-06-21 09:30:00.013994120 | submission | 16116348 | 100 | 31.05 | sell | 30.99 | 3788 | 30.95 | 300 | 31.05 | ... | 30.86 | 400 | 31.13 | 100 | 30.85 | 400 | 31.14 | 100 | 30.84 | 1600 |
| 2012-06-21 09:30:00.015247805 | submission | 16116658 | 100 | 31.04 | sell | 30.99 | 3788 | 30.95 | 300 | 31.04 | ... | 30.86 | 400 | 31.11 | 4500 | 30.85 | 400 | 31.13 | 100 | 30.84 | 1600 |
| 2012-06-21 09:30:00.015442111 | submission | 16116704 | 100 | 31.05 | sell | 30.99 | 3788 | 30.95 | 300 | 31.04 | ... | 30.86 | 400 | 31.11 | 4500 | 30.85 | 400 | 31.13 | 100 | 30.84 | 1600 |
| 2012-06-21 09:30:00.015789148 | submission | 16116752 | 100 | 31.06 | sell | 30.99 | 3788 | 30.95 | 300 | 31.04 | ... | 30.86 | 400 | 31.11 | 4500 | 30.85 | 400 | 31.13 | 100 | 30.84 | 1600 |
5 rows × 45 columns
def visualise_LOB(x:pd.DataFrame, order:pd.Series=None):
lob_cols = list(itertools.chain.from_iterable([[f"ask price {i}",f"ask size {i}", f"bid price {i}", f"bid size {i}"] for i in range(10)]))
lob_df = x[lob_cols]
LOB_array = np.array([lob_df[0:40:2],lob_df[1:40:2],["ask","bid"]*10]).transpose()
lob = pd.DataFrame(LOB_array, columns=["price","size","type"]).astype({"price":float,"size":float,"type":str})
if order is not None:
lob = lob.append(order)
fig = px.bar(lob, x='price', y='size',color="type") #, title=f"Orderbook at {x.name}")
fig.show()
visualise_LOB(msft.iloc[1548])
visualise_LOB(aapl.iloc[1548])
print("Note that AAPL has a much sparser limit order book than MSFT!")
Note that AAPL has a much sparser limit order book than MSFT!
def visualise_LOB_change(df:pd.DataFrame,iloc:int=1549):
order = df[["type","price","size"]].iloc[iloc]
visualise_LOB(df.iloc[iloc-1])
if order.type in ["execution_visible","deletion","execution_visible"]:
visualise_LOB(df.iloc[iloc], df[["type","price","size"]].iloc[iloc])
elif order.type=='submission':
visualise_LOB(df.iloc[iloc-1], df[["type","price","size"]].iloc[iloc])
else:
visualise_LOB(df.iloc[iloc])
visualise_LOB_change(msft)
def get_midprices(df:pd.DataFrame):
return (df["ask price 0"]+df["bid price 0"])/2
def get_microprices(df:pd.DataFrame):
imbalance=df["bid size 0"]/(df["bid size 0"]+df["ask size 0"])
return imbalance*df["ask price 0"]+(1-imbalance)*df["bid price 0"]
price_df = pd.DataFrame(aapl["ask price 0"].values, index=aapl.index, columns=["best ask"])
price_df["best bid"] = aapl["bid price 0"]
price_df["midprice"] = get_midprices(aapl)
price_df["microprice"] = get_microprices(aapl)
px.line(price_df.iloc[1000:2000], title="Evolution of the price processes")
midprices = get_microprices(aapl)
midprices_sec=midprices.resample("s").mean()
midprices_min=midprices.resample("T").mean()
sec_returns = 100*midprices_sec.pct_change().dropna()
min_returns = 100*midprices_min.pct_change().dropna()
sns.histplot(sec_returns, stat="probability", bins=20)
plt.title("Distribution of 1 second percentage returns")
plt.show()
sns.histplot(min_returns, stat="probability", bins=20)
plt.title("Distribution of 1 minute percentage returns")
plt.show()
Note that as the time increases, we start to look more like a Gaussian distribution. There is only one day of data but, if we had more, we would see this pattern continues to hours, days, weeks, etc.
fig, ax = plt.subplots(1, 2, figsize=(20, 7))
sm.qqplot(sec_returns.values, line='s', ax = ax[0])
ax[0].set(title='1 second returns Q-Q plot',ylabel='Returns')
sm.qqplot(min_returns.values, line='s', ax = ax[1])
ax[1].set(title='1 minute returns Q-Q plot',ylabel='Returns')
plt.show()
from statsmodels.tsa.stattools import acf, pacf
import plotly.graph_objects as go
def create_corr_plot(series, plot_pacf=False, title='Autocorrelation (ACF)'):
corr_array = pacf(series.dropna(), alpha=0.05) if plot_pacf else acf(series.dropna(), alpha=0.05)
lower_y = corr_array[1][:,0] - corr_array[0]
upper_y = corr_array[1][:,1] - corr_array[0]
fig = go.Figure()
[fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color='#3f3f3f')
for x in range(len(corr_array[0]))]
fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color='#1f77b4',
marker_size=12)
fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines',fillcolor='rgba(32, 146, 230,0.3)',
fill='tonexty', line_color='rgba(255,255,255,0)')
fig.update_traces(showlegend=False)
fig.update_xaxes(range=[-1,42])
fig.update_yaxes(zerolinecolor='#000000')
title
fig.update_layout(title=title)
fig.show()
create_corr_plot(sec_returns, title="Autocorrelation of 1 second returns")
create_corr_plot(np.abs(sec_returns), title="Autocorrelation of 1 second absolute returns")
Note that the returns over the previous second has no clear influence over what will happen over the next second. However, the absolute returns (a measure of volatility) have a persistent nature.
interarrival = pd.Series(aapl.index).diff().dropna().dt.total_seconds()*10**6
interarrival.name="microseconds"
len(interarrival)
400390
interarrival.sample(10000).isna().sum()
0
plt.figure(figsize=(20,10))
sns.histplot(interarrival.sample(10000), stat="probability", binwidth=30*10**3, binrange=(0,1.5*10**6))
plt.title("Distribution of interarrival times", fontsize=20)
plt.show()
plt.figure(figsize=(20,10))
sns.histplot(interarrival.sample(10000), stat="probability", binwidth=30*10**3, binrange=(0,1.5*10**6), log_scale=(False,True))
plt.title("Distribution of interarrival times (with log probability)", fontsize=20)
plt.show()
Note that even with a log y-axis, it doesn't seem like we have a linear relationship. We check this by plotting the Q-Q plot with the exponential distibution.
sample=interarrival.sample(10000)
sm.qqplot(interarrival[interarrival>1000000],dist=stats.expon, line="s", )
plt.show()
Idea: fit a power law
sample_rounded = 25000 * round(sample/25000)
count = dict()
for micro in sample_rounded.unique():
count[micro]= len(sample_rounded[sample_rounded==micro])
count = pd.Series(count)
params=stats.powerlaw.fit(count)
sm.qqplot(interarrival[interarrival>2000000], stats.powerlaw, distargs=[0.05], line="r")
plt.show()
book=aapl
def add_filled_column(book:pd.DataFrame):
df=book.copy()
df["filled"]=False
df.loc[df["type"]=="execution_visible","size"]=df.loc[df["type"]=="execution_visible","size"]*-1
summed_orders = df.groupby("external id")["size"].sum()
filled_ids = summed_orders[summed_orders==0].index
df.loc[(df["external id"].isin(filled_ids))&(df["type"]=="submission"), "filled"]=True
df.loc[df["type"]=="execution_visible","size"]=df.loc[df["type"]=="execution_visible","size"]*-1
return df
submitted_orders = add_filled_column(book)
submitted_orders=submitted_orders[submitted_orders.type=="submission"]
proportion_filled = len(submitted_orders[submitted_orders.filled==True])/len(submitted_orders)
print(f"Only {round(100*proportion_filled,1)}% of submitted (visible) orders were filled")
Only 8.3% of submitted (visible) orders were filled
submitted_orders["midprice"] = (submitted_orders["ask price 0"]+submitted_orders["bid price 0"])/2
submitted_orders["distance to midprice"] = round(np.abs(submitted_orders["price"]-submitted_orders["midprice"]),3)
plt.figure(figsize=(20,10))
sns.histplot(submitted_orders["distance to midprice"], binwidth=0.01, binrange=(0,0.8), label="Submitted orders")
sns.histplot(submitted_orders.loc[submitted_orders.filled==True]["distance to midprice"], binrange=(0,0.8), binwidth=0.01, color="orange", label="Filled orders")
plt.legend(fontsize=16)
plt.title("Distribution of order placements and their associated fill probabilities", fontsize=25)
plt.show()
fill_probs = dict()
for distance in np.arange(0,0.3,0.005):
if len(submitted_orders.loc[submitted_orders["distance to midprice"]==distance])==0:
fill_probs[distance/0.05]=0
else:
fill_probs[distance/0.05]=len(submitted_orders.loc[(submitted_orders["distance to midprice"]==distance)&(submitted_orders["filled"]==True)])/len(submitted_orders.loc[submitted_orders["distance to midprice"]==distance])
fill_probs[0]=1
fill_probs = pd.Series(fill_probs, name="ticks")
fill_probs.index.name="ticks"
fill_probs
ticks 0.0 1.000000 0.1 0.679715 0.2 0.536649 0.3 0.483374 0.4 0.448099 0.5 0.430788 0.6 0.407407 0.7 0.343992 0.8 0.333763 0.9 0.288292 1.0 0.269993 1.1 0.230419 1.2 0.212077 1.3 0.186023 1.4 0.164830 1.5 0.147965 1.6 0.137385 1.7 0.123291 1.8 0.098308 1.9 0.079900 2.0 0.083298 2.1 0.059554 2.2 0.052834 2.3 0.047713 2.4 0.042754 2.5 0.039494 2.6 0.029644 2.7 0.032106 2.8 0.024344 2.9 0.026111 3.0 0.017936 3.1 0.020004 3.2 0.015178 3.3 0.015107 3.4 0.016320 3.5 0.000000 3.6 0.016671 3.7 0.019388 3.8 0.014392 3.9 0.016074 4.0 0.015194 4.1 0.000000 4.2 0.019719 4.3 0.017912 4.4 0.019881 4.5 0.013514 4.6 0.019022 4.7 0.000000 4.8 0.028611 4.9 0.017354 5.0 0.017350 5.1 0.018839 5.2 0.018009 5.3 0.016299 5.4 0.025974 5.5 0.027668 5.6 0.022883 5.7 0.000000 5.8 0.021795 5.9 0.010695 Name: ticks, dtype: float64
px.line(fill_probs, title="Fill probabilities of orders placed in terms of distance from midprice")
To be added